regexjava模式。带重叠分隔符的split（）

2 月 Questions & Answers 360

首先，我知道有人问过类似的问题，比如：

How to split a string, but also keep the delimiters?

但是，我在使用模式实现字符串拆分时遇到了问题。split（），其中模式基于分隔符列表，但有时它们会重叠。以下是一个例子：

目标是基于一组由斜杠包围的已知码字分割字符串，其中我需要保留分隔符（码字）本身及其后面的值（可能是空字符串）

对于本例，码字为：

/ABC/
/DEF/
/GHI/

基于上面引用的线程，使用“向前看”和“向后看”将字符串标记为码字和值，按照如下方式构建模式：

((?<=/ABC/)|(?=/ABC/))|((?<=/DEF/)|(?=/DEF/))|((?<=/GHI/)|(?=/GHI/))

工作字符串：

"123/ABC//DEF/456/GHI/789"

使用split，这将很好地标记为：

"123","/ABC/","/DEF/","456","/GHI/","789"

问题字符串（注意“ABC”和“DEF”之间的单斜杠）：

"123/ABC/DEF/456/GHI/789"

这里的期望值是“DEF/456”是“/ABC/”码字之后的值，因为“DEF/”位实际上不是一个码字，只是碰巧看起来像一个

预期结果是：

"123","/ABC/","DEF/456","/GHI/","789"

实际结果是：

"123","/ABC","/","DEF/","456","/GHI/","789"

正如您所看到的，“ABC”和“DEF”之间的斜杠作为标记本身被隔离

我尝试了其他线程的解决方案，只使用“向前看”或“向后看”，但它们似乎都有相同的问题。感谢您的帮助

Tags:

共 (3) 个答案

# 1 楼答案

如果您可以使用find而不是split，使用一些非贪婪匹配，请尝试以下方法：

public class SampleJava {
static final String[] CODEWORDS = {
    "ABC",
    "DEF",
    "GHI"};
static public void main(String[] args) {
    String input = "/ABC/DEF/456/GHI/789";
    String codewords = Arrays.stream(CODEWORDS)
            .collect(Collectors.joining("|", "/(", ")/"));
    //     codewords = "/(ABC|DEF|GHI)/";
    Pattern p = Pattern.compile(
/* codewords */ ("(DELIM)"
/* pre-delim */ + "|(.+?(?=DELIM))"
/* final bit */ + "|(.+?$)").replace("DELIM", codewords));
    Matcher m = p.matcher(input);
    while(m.find()) {
        System.out.print(m.group(0));
        if(m.group(1) != null) {
            System.out.print(" ← code word");
        }
        System.out.println();
    }
}
}

输出：

/ABC/ ← code word

DEF/456

/GHI/ ← code word

789

# 2 楼答案
使用积极和消极环视的组合：
```
String[] parts = s.split("(?<=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI)/....)|(?=/(ABC|DEF|GHI)/)(?<!/(ABC|DEF|GHI))");
```
通过在单个“向前看/向后看”中使用交替，也有相当大的简化

见live demo

# 3 楼答案

以下是一些TDD principles（红绿重构），我将如何实现这种行为：

书写规格（红色）

我定义了一组单元测试来解释我是如何理解您的“标记化过程”的。如果任何测试不符合您的期望，请随时告诉我，我将相应地编辑我的答案

import static org.assertj.core.api.Assertions.assertThat;

import java.util.List;

import org.junit.Test;

public class TokenizerSpec {

    Tokenizer tokenizer = new Tokenizer("/ABC/", "/DEF/", "/GHI/");

    @Test
    public void itShouldTokenizeTwoConsecutiveCodewords() {
        String input = "123/ABC//DEF/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "/DEF/", "456");
    }

    @Test
    public void itShouldTokenizeMisleadingCodeword() {
        String input = "123/ABC/DEF/456/GHI/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/", "DEF/456", "/GHI/", "789");
    }

    @Test
    public void itShouldTokenizeWhenValueContainsSlash() {
        String input = "1/23/ABC/456";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("1/23", "/ABC/", "456");
    }

    @Test
    public void itShouldTokenizeWithoutCodewords() {
        String input = "123/456/789";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123/456/789");
    }

    @Test
    public void itShouldTokenizeWhenEndingWithCodeword() {
        String input = "123/ABC/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("123", "/ABC/");
    }

    @Test
    public void itShouldTokenizeWhenStartingWithCodeword() {
        String input = "/ABC/123";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "123");
    }

    @Test
    public void itShouldTokenizeWhenOnlyCodeword() {
        String input = "/ABC//DEF//GHI/";

        List<String> tokens = tokenizer.splitPreservingCodewords(input);

        assertThat(tokens).containsExactly("/ABC/", "/DEF/", "/GHI/");
    }
}

根据规范实施（绿色）

这门课使以上所有的测试都通过了

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Optional;

public final class Tokenizer {

    private final List<String> codewords;

    public Tokenizer(String... codewords) {
        this.codewords = Arrays.asList(codewords);
    }

    public List<String> splitPreservingCodewords(String input) {
        List<String> tokens = new ArrayList<>();

        int lastIndex = 0;
        int i = 0;
        while (i < input.length()) {
            final int idx = i;
            Optional<String> codeword = codewords.stream()
                                                 .filter(cw -> input.substring(idx).indexOf(cw) == 0)
                                                 .findFirst();
            if (codeword.isPresent()) {
                if (i > lastIndex) {
                    tokens.add(input.substring(lastIndex, i));
                }
                tokens.add(codeword.get());
                i += codeword.get().length();
                lastIndex = i;
            } else {
                i++;
            }
        }

        if (i > lastIndex) {
            tokens.add(input.substring(lastIndex, i));
        }

        return tokens;
    }
}

改进实现（重构）

目前还没有完成（没有足够的时间，我现在可以花在这个答案上）。如果您要求，我将很乐意对Tokenizer进行重构（但稍后会进行）。：-）或者你也可以自己做，因为你有单元测试来避免回归

Python中文网

有 Java 编程相关的问题?